gaussian processes
Thus far, our MLMs have grouped observations within essentially nominal categories – even when we use numbers, we’re treating these categories as unordered and discrete. A participant id is just a label for a unique “thing”. In these models, partial pooling does double duty – it improves accuracy by borrowing information from other groups but estimates variation across groups.
But consider other types of grouping that are not so distinct – McElreath shows examples using spatial distance, but also consider time or age. Individuals of the same age share some exposure (cultural trends, political and historical events, even climate); they also share exposure with people of similar ages. And even though we assign labels to generations, guessing the similarity between two people is probably better done by the differences in their ages rather than whether they share a generational label.
In this case, it wouldn’t make as much sense to fit a separate intercept for individuals of the same age, because the model would borrow equally from all other groups. Instead, we would rather the model borrow information in proportion to the closeness of other groups (e.g., intercepts for 27-year-olds should be more informed by data from 28- and 26-year-olds than 67-year-olds).
The general approach to this is known as GAUSSIAN PROCESS REGRESSION.
Gaussian processes can be useful for answering questions like:
From the brms manual:
A GP is a stochastic process, which describes the relation between one or more predictors \(x = (x_1, ..., x_d)\) and a response \(f(x)\), where \(d\) is the number of predictors. A GP is the generalization of the multivariate normal distribution to an infinite number of dimensions. Thus, it can be interpreted as a prior over functions. The values of \(f()\) at any finite set of locations are jointly multivariate normal, with a covariance matrix defined by the covariance kernal \(k_p (x_i,x_j)\), where \(p\) is the vector of parameters of the GP:
\[ f(x_1),...,f(x_n) \sim \text{MVN}(0, (k_p(x_i,x_j))^n_{i,j=1}) \]
The smoothness and general behavior of the function \(f\) depends only on the choice of the covariance kernel.
The smoothness and general behavior of the function \(f\) depends only on the choice of the covariance kernel. In plain English, a kernel is like a “relationship measurer” between any two points in your data. It answers the question: “If I know the value at point A, how much does that tell me about the value at point B?”
The kernel determines how the influence spreads across your data. If two points are close together according to the kernel, their values will be similar. If two points are far apart, their values will be uncorrelated (but not necessarily far apart).
Different kernels create different patterns of relationships. For example:
The mathematical formula of the kernel determines exactly how this similarity decays with distance, time, or whatever dimension you’re working with.
Currently there are four available kernels in brms (in order from smoothest to roughest):
exp_quad: exponentiated quadraticmatern52: Matern 5/2matern32: Matern 3/2exponential: exponential(Details about the kernels from the Stan manual.)
With magnitude \(\sigma\) and length scale \(l\), the exponentiated quadratic kernel is:
\[ k(x_i, x_j) = \sigma^2\text{exp}(-\frac{|x_i-x_j|^2}{2l^2}) \] * Think of it as creating very gentle hills and valleys - no sudden jumps or sharp corners * Influence between points drops off quickly (like a bell curve) * If you know the value at age 25, it gives you strong information about age 24 and 26, moderate information about age 22 and 28, and very little about age 15 or 35 * Best for processes you believe are inherently smooth and continuous * Example: Physical growth patterns in children might follow this kind of smooth curve
This creates somewhat rougher patterns than the exponentiated quadratic.
With magnitude \(\sigma\) and length scale \(l\), the Matern 3/2 kernel is:
\[ k(x_i,x_j) = \sigma^2(1 + \frac{\sqrt{3}|x_i-x_j|}{l})\text{exp}(- \frac{\sqrt{3}|x_i-x_j|}{l}) \]
This sits between the very smooth exp_quad and the rougher matern32.
With magnitude \(\sigma\) and length scale \(l\), the Matern 3/2 kernel is:
\[ k(x_i,x_j) = \sigma^2(1 + \frac{\sqrt{5}|x_i-x_j|}{l}+ \frac{\sqrt{5}|x_i-x_j|^2}{3l^2})\text{exp}(- \frac{\sqrt{5}|x_i-x_j|}{l}) \]
exp_quad while still maintaining good continuityThis is the roughest of the four kernels.
With magnitude \(\sigma\) and length scale \(l\), the exponential kernel is:
\[ k(x_i, x_j) = \sigma^2\text{exp}(-\frac{|x_i-x_j|}{l}) \]
Data come from a 26-wave (every 2 weeks) study of political attitudes (Brandt et al., 2021).
Should federal spending on defense be increased, decreased, or kept the same?
d %>%
count(def) %>%
mutate(
col = ifelse(def >7, "1","2"),
def = factor(def,
levels=c(1:9),
labels = c("1\nGreatly decrease", "2", "3", "4\nKeep the same", "5", "6", "7\nGreatly increase", "8\n Don't Know", "9\n Haven't thought"))) %>%
ggplot(aes(x=def, y=n)) +
geom_col(aes(fill=col)) +
scale_x_discrete(labels = label_wrap(10)) +
labs(title = "Should federal spending on defense be changed?",
x=NULL,
y = "count") +
theme(legend.position = "none")Let’s remove those values on the variable that are not part of our Likert scale.
We’ll be adding age at baseline to the model. Age was only collected during wave 1, so we’ll simply carry that through all the survey responses. There’s a few ways to do that. I’ll show you how to use the fill function in tidyverse. We’ll also remove anyone who didn’t report an age.
Let’s say we’re interested in studying how time and age impact the responses to this question. We’ll build up our models from most simple to most complex.
\[\begin{align*} R_{ij} &\sim \text{Categorical}(p) \\ \text{logit}(p_k) &= \alpha_{k,j} - \phi \\ \phi &= 0 \\ \alpha_{k,j} &= \bar{\alpha}_k + U_{kj} \\ \bar{\alpha} &\sim \text{Normal}(0, 1.5) \\ \sigma_{\alpha_k} &\sim{Exponential}(1) \end{align*}\]
\[\begin{align*} R_ij &\sim \text{Categorical}(p) \\ \text{logit}(p_k) &= \alpha_{k,j} - \phi \\ \phi &= \beta_1\text{age}_i + \beta_2\text{wave}_i \\ \alpha_{k,j} &= \bar{\alpha}_k + U_{kj} \\ \bar{\alpha}_k &\sim \text{Normal}(0, 1.5) \\ \beta_1 &\sim \text{Normal}(0, 1) \\ \beta_2 &\sim \text{Normal}(0, 1) \\ \sigma_{\alpha_K} &\sim \text{Exponential}(1) \end{align*}\]
m3 <- brm(
data = d,
family = cumulative,
bf( def ~ a + b1*wave,
a ~ 1 + gp(age, cov = "exp_quad", scale = F) + (1|id),
b1 ~ 1,
nl = TRUE),
prior = c( prior(normal(0, 1.5), nlpar=a),
prior(inv_gamma(2.5, 3), class = lscale, coef = gpage, nlpar = a),
prior(exponential(1), class = sdgp, coef = gpage, nlpar = a)),
iter=5000, warmup=1000, seed=3, cores=4,
file = here("files/models/m91.3")
)As a reminder, the kernel formula for our GP is
\[ k(x_i, x_j) = \sigma^2\text{exp}(-\frac{|x_i-x_j|^2}{2l^2}) \] The value \(|x_i-x_j|^2\) is the absolute squared distance between two values. That means, the two values estimated in our model are \(\sigma\) and \(l\). What do each of these mean?
Let’s start with \(l\): The value of \(1/(2\times l^2)\) is equal to \(\rho\). So we can rewrite the equation as
\[ k(x_i, x_j) = \sigma^2\text{exp}(-\rho|x_i-x_j|^2) \] In other words, the covariance between two values is declines exponentially with the squared distance between them.
The remaining piece, \(\sigma^2\) is the maximum covariance between any two values.
I used an invgamma distribution for as my prior for \(l\) and an expoential distribution as the prior for \(\sigma\). Here’s what that looks like:
set.seed(11)
nsim = 50
sample_l = invgamma::rinvgamma(nsim, 2.5, 3)
sample_sig = rexp(nsim, 1)
# wrangle into functions
p1 = tibble(
.draw = 1:nsim,
l = sample_l,
sig = sample_sig) %>%
mutate(sigsq = sig^2,
rhosq = 1 / (2 * l^2)) %>%
expand_grid(x = seq(from = 0, to = 10, by = .05)) %>%
mutate(covariance = sigsq * exp(-rhosq * x^2)) %>%
# plot
ggplot(aes(x = x, y = covariance)) +
geom_line(aes(group = .draw),
linewidth = 1/4, alpha = 1/4, color = "#1c5253") +
stat_function(fun = function(x) mean(post$sdgp_a_gpage)^2 *
exp(-(1 / (2 * mean(post$lscale_a_gpage)^2)) * x^2),
color = "#0f393a", linewidth = 1) +
scale_x_continuous("distance (age)", expand = c(0, 0),
breaks = 0:5 * 2) +
coord_cartesian(xlim = c(0, 10),
ylim = c(0, 2)) +
labs(subtitle = "Gaussian process prior")
p1# A tibble: 16,000 × 5
.chain .iteration .draw sdgp_a_gpage lscale_a_gpage
<int> <int> <int> <dbl> <dbl>
1 1 1 1 2.75 0.933
2 1 2 2 3.13 1.26
3 1 3 3 2.63 1.10
4 1 4 4 1.42 0.870
5 1 5 5 1.29 0.655
6 1 6 6 2.10 0.963
7 1 7 7 1.65 0.479
8 1 8 8 1.91 0.692
9 1 9 9 1.52 0.816
10 1 10 10 1.63 0.514
# ℹ 15,990 more rows
Here’s our posterior distribution of the Gaussian process
post <- as_draws_df(m3)
# for `slice_sample()`
set.seed(14)
# wrangle
p2 <-
post %>%
mutate(sigsq = sdgp_a_gpage^2,
rhosq = 1 / (2 * lscale_a_gpage^2)) %>%
slice_sample(n = 50) %>%
expand_grid(x = seq(from = 0, to = 10, by = .05)) %>%
mutate(covariance = sigsq * exp(-rhosq * x^2)) %>%
# plot
ggplot(aes(x = x, y = covariance)) +
geom_line(aes(group = .draw),
linewidth = 1/4, alpha = 1/4, color = "#1c5253") +
stat_function(fun = function(x) mean(post$sdgp_a_gpage)^2 *
exp(-(1 / (2 * mean(post$lscale_a_gpage)^2)) * x^2),
color = "#0f393a", linewidth = 1) +
scale_x_continuous("distance (age)", expand = c(0, 0),
breaks = 0:5 * 2) +
coord_cartesian(xlim = c(0, 10),
ylim = c(0, 2)) +
labs(subtitle = "Gaussian process posterior")
p1 + p2